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Abstract. In signal detection problems, one is usually faced with the task of searching a 
^ . parameter space for peaks in the likelihood function which indicate the presence of a signal. 

Random searches have proven to be very efficient as well as easy to implement, compared e.g. 
to searches along regular grids in parameter space. Knowledge of the parameterised shape of 
£ — ■ the signal searched for adds structure to the parameter space, i.e., there are usually regions 

requiring to be densely searched while in other regions a coarser search is sufficient. On the 
other hand, prior information identifies the regions in which a search will actually be promising 
or may likely be in vain. Defining specific figures of merit allows one to combine both template 
metric and prior distribution and devise optimal sampling schemes over the parameter space. 
We show an example related to the gravitational wave signal from a binary inspiral event. Here 
the template metric and prior information are particularly contradictory, since signals from 
low-mass systems tolerate the least mismatch in parameter space while high-mass systems are 
£NJ ■ far more likely, as they imply a greater signal-to-noise ratio (SNR) and hence are detectable 

^ ' to greater distances. The derived sampling strategy is implemented in a Markov chain Monte 

Carlo (MCMC) algorithm where it improves convergence. 
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1. Introduction 

Signal detection, in gravitational wave detection in particular, frequently entails the problem 
of performing a computationally expensive numerical search over a large parameter space. 
The search here means a search for a peak in the likelihood function, or another detection 
^ ' statistic, based on the data at hand and varying the unknown signal parameters. A peak or a 

threshold excess then indicates the presence of a signal [1, 2]. Such "brute- force" searches may 
be implemented as grid searches, evaluating the detection statistic at regularly placed points in 
parameter space. Computing the detection statistic usually means evaluating the match between 
a signal template and the data; the spacing between evaluated points in parameter space is then 
usually based on a template metric which ensures that all possible signals (corresponding to 
points in parameter space) have at least a certain minimal match with one of the evaluated 
templates (corresponding to the grid points). Instead of using regularly spaced template banks 
(e.g. [3, 4, 5, 6]), the use of random template hanks has recently gained popularity, as these 
are often very easily implemented, and have also be shown to be very efficient, especially in 
higher dimensions [7, 8, 9, 10]. Here the idea is to populate the parameter space randomly, but 
uniformly with respect to the template metric. 

These template placement strategies have by now usually been based on "minimax" reasoning, 
by aiming at minimizing the maximal (worst-case) mismatch across the whole parameter space. 
Once one takes prior information on the unknown parameters into consideration, by accounting 
for a priori probabilities attached to different regions of parameter space, a decision-theoretic 



approach allows us to devise other strategies, effectively concentrating efforts on the more 
promising regions of parameter space in pursuit of a certain optimality criterion [11, 12]. In 
fact, a minimax strategy may often only exist once one imposes hard bounds on the parameter 
space (and by that ensuring the existence of an absolute worst case). 

Markov chain Monte Carlo (MCMC) methods are meanwhile widely used for (Bayesian) 
parameter estimation in the signal processing stage for gravitational-wave signals [13, 14, 15]. 
MCMC algorithms are, first of all, methods for stochastic integration [16, 17], although by the 
way they work they often behave similarly to stochastic search algorithms as well. This is 
in fact a most welcome property, as part of the parameter estimation problem is usually also a 
search/optimization problem, as, besides integration over the parameters' posterior distribution, 
it requires finding the global mode or secondary modes. Parallel tempering [18, 19] is a 
variety of the Metropolis-Hastings MCMC algorithm (and a special case of Metropolis-coupled 
MCMC algorithm [20, 17]) aimed at enhancing these stochastic search capabilities. This is 
done by basically running several MCMC chains in parallel, where tempering at increasing 
temperature values is applied to subsequent chains (as in simulated annealing methods [21, 22]), 
and additional steps are introduced to allow for communication between chains [23]. Parallel 
tempering methods have been applied to gravitational-wave data analysis for binary inspiral 
signals in the context of ground-based [24] and space-based (LISA) measurements [23], where 
they have proven advantageous especially in cases of high SNR and of posterior distributions 
exhibiting multiple modes or degeneracies [25, 26]; they have also been adopted for the analysis 
of burst signals [27] and utilized for Bayesian evidence computation [28]. 

Among the parallel Markov chains being run at different 'temperatures' within the parallel 
tempering implementation, the 'cool' ones with no tempering applied produce samples from the 
posterior distribution for the stochastic integration part, while the high-temperature chains are 
producing samples for the stochastic search. The question now is how to set up the algorithm 
so that the search is most efficient, given our knowledge of prior and template metric, i.e., our 
knowledge of "where the true parameters are (un-) likely to be", and "how hard one needs to 
look" across the parameter space. The problem is of special interest in the context of binary 
inspiral signals, as prior and template metric are particularly contradictory: a priori one is most 
likely do detect an inspiral involving high masses, as these result in a high-SNR signal that is 
detectable to a greater distance. On the other hand, considering the template metric only, one 
might want to mostly try low-mass templates, since at low masses the template's and true signal 
parameters need to be in very close agreement in order for them to match, while at high masses 
greater discrepancies still yield a good match. What needs to be defined is the distribution to 
sample from in order to find the mode(s) fastest, which is very similar to setting up a random 
template bank, the difference being that one does not settle on some fixed number of templates, 
as the MCMC sampler in principle is thought to sample indefinitely. 

In the following Sec. 2, we will introduce the problem for the case of binary inspiral signals, 
and Sec. 3 briefly introduces the parallel tempering context. In Sec. 4 the general problem is 
formulated in decision-theoretic terms and solved for a particular optimality criterion. Sec. 5 
shows some illustrative examples, and Sec. 6 eventually closes with conclusions and perspectives. 

2. Binary inspiral parameters 

In the simplest description, a binary inspiral signal as measured by ground-based interferometers 
is determined by 9 parameters: sky location (declination 5, right ascension a), polaristion 
(tp), companion masses (mi, m2), luminosity distance (d£), time of arrival (t c ), phase {(f) 
and inclination angle (i). Assuming some prior distribution for the masses (in the following 
simply defined to be uniform, mi,ro2 £ [1M Q ,10M Q ]), and an isotropic distribution of events 
across space while folding in the detectability as a function of signal-to-noise ratio (SNR), one 
can derive a joint prior distribution whose marginal distribution of masses is shown in Fig. 1 
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Figure 1. (Marginal) densities of the distributions it, p and p* for the two mass parameters 
(mi, 1712) of a binary inspiral signal. The prior (left plot) indicates that high masses are most 
likely, which is because they result in stronger signals that are detectable to greater distances. 
The template metric on the other hand implies that low masses require a dense template spacing 
(middle plot). 



[29, 23]. A template metric may be defined following [30, 31], assuming the metric to be constant 
in the space of the Newtonian and 1.5 PN chirp times X\ and A2, which are functions of the mass 
parameters. For the remaining parameters, for now, we again assume the metric to be uniform 
(t c , log ((if,)) and isotropic (5, a, tp, 1, </>). The implied distribution in terms of (mi, 771,2) following 
from a uniform spacing in (Ai, A2) may be derived using the reparametrisation explicated in [32]. 
This distribution is shown in Fig. 1. 

3. Parallel tempering 

In the context of Monte Carlo integration, tempering is utilised to prevent the integration 
algorithm from getting stuck in local modes of the distribution from which it is sampling. A 
temperature parameter T > 1 is introduced, and instead of sampling from the distribution of 
actual interest, with density function f(9), the modified distribution 

f [T) (9) oc f(9)r (1) 

is used. The introduced exponent is supposed to make the distribution more tractable, as it 
has a "flattening" effect on the density; the same effect is also taken advantage of in simulated 
annealing methods [22]. In the limit of T — ► 00, the density f(r){9) then approaches a uniform 
distribution [17]. In the context of posterior inference, when the target distribution f{6) is 
the product of prior tt(0) and likelihood C(9), it may be more sensible to use a scheme only 
tempering the likelihood part: 

f (T) (9) oc 7t(9)£(9)±, (2) 

in which case f(r){9) goes towards the prior tt(9) for T — > 00 [23]. Both uniform distribution 
and prior distribution may in general not be the most sensible choice, as was pointed out above, 
since the tempering is also supposed to enhance the algorithm's stochastic search properties. 
Assume that one had a distribution p*(9) available, which leads to an optimal sampling (w.r.t. 
to some pre-specified criterion), and which is then the desired density for T — > 00. This suggests 
a generalized tempering parametrisation: 

f (T) (9) oc p*(0) (^) T = P*(*) 1- * /(*)* (3) 



which in the special cases of p*(0) oc 1 and p*(8) = tt(9) again yields the tempering schemes 
from (1) and (2) above. The question now is how to choose such a limiting distribution p*(8) 
based on given prior information and template metric. 



4. The decision theoretic approach 

Let g(0) be the determinant of the template metric as a function of the signal parameters. A 
large value of g means that that templates need to be densely spaced around 9, while a smaller 
g indicates that a coarser spacing is sufficient. The volume "covered" by a template placed at 
parameter is proportional to g(9)~2, and hence the probability density to sample from for 
setting up a random template bank is given by p(9) oc \Jg{9) [8]. 

Now consider the case of the true parameter value being 9q £ @. The actual value 9q is 
unknown, what is known is the prior probability density tt(9). Whenever a template 9* is placed 
in parameter space, it is considered a match if it was sufficiently close to the true value 9q. 
What exactly is "sufficiently close" is determined via mismatch considerations and is expressed 
through the template metric. Then the probability of a match is 

P(match|0*) = c _L^(r), (4) 

where c G H + is a constant depending on how close a match actually is required to be. If one 
was to pick a single template 9*, the chances for success would obviously be maximal where the 
above product reaches its maximum. Analogously, consider the case of a given true value 9q and 
repeated, independent "guesses" drawn from p*{9). Then for each single guess the probability 
of success is 

P(match|# ) = c-^=p*(9 ). (5) 

What is desired is a distribution p* from which to generate independent draws so that the 
chances of getting a match are "optimal". Whether or when one will get a match is a matter 
of chance, depending on both the true value 9o £ @ and the choice of p* 6 V* , where P* is the 
space of probability distributions over 0. Suppose we are interested in minimizing the expected 
number of trials T (or waiting' time) until the first match. Any choice of p* implies a probability 
distribution for T; for a given true value 9q an d a sampling distribution p*, T follows a geometric 
distribution with density and expectation: 



P(T = t\9 ) = (i-c-^p*^)) ( c ^^*(0 o) ), E[T\9 ] = j — -. (6) 



In decision theoretic terms, we are given a state-of-nature space 0, an action space V*, and 
a loss function L : x V* — ► 1R with L(9q,p*) = E p * [T\9q] [11, 12]. An optimal choice of p* may 
now be determined by minimizing the expected loss; integrating over the possible values that #o 
could take, that (prior) expectation is 

m = - I i * ^ irWdfl, (7) 



which is minimized by choosing 



(8) 



i.e., the optimal p* here is proportional to the geometric mean of tt and p, and independent of c. 




Figure 2. Densities of the distributions ir, p and p* for the toy example discussed in Sec. 5.2. 



The distribution defined through the density p that is usually utilized for random template 
banks [8] plays a particular role in this context. From equation (5) one can see that by setting 
p* := p the probability of a match (and with that also the waiting time) becomes independent 
of the actual parameter value 9q, so that p constitutes an equalizer rule. From (8) it follows 
that p will be optimal in the case that the prior happens to be ir = p. This implies that ir = p 
defines the "least favourable prior distribution 11 for this case, and that p* = p also constitutes the 
minimax strategy (independent from the particular prior ir), as it minimizes the maximum of 
E[T | 9q] across all possible true values 9q [11]. Since p* =p leads to a uniform match probability 
in (5), it actually constitutes the equalizer rule for the wider family of optimality criteria that 
are functions of P (match | 9q). 



5. Examples 

5. 1 . Toy example 1 : Gaussian prior 

Consider a parameter space = II where the prior is Gaussian with mean p and variance a 2 : 
ir = N(/x, a 2 ), and the template metric is Rat, i.e., g{6) = 7 is independent of 9. Then the 
equalizer rule p does not exist, and the optimal rule would be p* = N(/x, 2a 2 ). 



5.2. Toy example 2: Numerical simulation 

Consider a parameter space = [0,1], where the prior and template metric behave as shown 
in Fig. 2. For this simple case the behaviour of different sampling strategies can be simulated 
numerically, by drawing "true" parameter values #0 from the prior distribution and then drawing 
"guesses" 9* from either p or p* in order to see how the strategies differ. 
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Figure 3. Cumulative distri- 
butions of the resulting wait- 
ing times T when using sam- 
pling strategies p* and p in the 
toy example of Sec. 5.2. The 
right panel shows a zoom-in on 
the differing tail behaviour. 



Fig. 3 illustrates the distribution of the resulting times T, for both the minimax and optimal 
strategies p and p*. As expected, the average waiting time is lower for p*, and one can see that 
the minimax strategy performs better in the unlikely "worst cases" . 




Figure 4. This plot illustrates the behaviour of a Parallel Tempering algorithm utilizing the 
distribution p* when running on simulated data. The left panel shows how the algorithm's 'cool' 
chains manage to ascend to greater likelihood values while the tempered chains keep sampling 
at lower likelihood values. The 2nd panel is a scatter plot of mass parameter samples from all 
the different chains (after the algorithm's burn- in phase). The right panel eventually shows the 
resulting mass parameters' marginal posterior density derived from the 'cool' chain $4 alone; 
the cross indicates the true parameter value. 



5.3. Binary inspiral example 

The prior tt and minimax sampling rule p for the mass parameters of a binary inspiral event were 
shown in Fig. 1. The right panel of the same figure also shows the resulting optimized sampling 
distribution p*. The obvious discrepancy between least favourable (/>) and actual prior (tt) 
suggests that there actually is a gain in doing the optimization. Fig. 4 shows how a parallel 
tempering algorithm for parameter estimation behaves when utilizing the distribution p* for 
high-temperature chains as described in Sec. 3 (3). The MCMC chains quickly converge to the 
true parameter values, while the higher-temperature chains keep scanning the parameter space 
efficiently. 

6. Conclusions and outlook 

We have applied a decision-theoretic approach in order to derive an optimized sampling 
distribution to be used within a parallel tempering MCMC implementation. The optimization 
step here provides a natural link between the parameter space metric and the prior information 
about the parameter values. The particular optimality criterion chosen here (the expected time 
until a matching template is found, E[T|#o]) turns out to be computationally convenient, as 
the resulting sampling distribution p* is independent of the particular mismatch threshold c, 
and is almost trivial to implement within an MCMC application. Other criteria are conceivable 
though, like the probability of a missed detection within N samples P(T> N\8q) for example, 
which may then lead to more complicated results. 

The general approach used here should also be useful in other contexts; it turns out that the 
distribution usually used for setting up random template banks here constitutes the special case 
of a minimax strategy, which implies that the explicit specification of particular figures-of-merit 
and the consideration of prior information may yield great efficiency improvements, especially 
in cases where the implicitly assumed least favourable prior greatly deviates from the actual 
prior information as in the binary inspiral case. In the framework discussed above, the resulting 
optimized sampling distribution p* even exists for cases where the minimax rule does not (as 
in the example of Sec. 5.1 above). This suggests that a similar approach may also make other 
ad-hoc fixes like the mass parameter bounds in the binary inspiral example dispensable, as it 
would naturally focus in on the promising parameter range while ruling out too unlikely and 



too costly regions of parameter space. 
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